NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

A goodness‐of‐fit test for regression models with discrete outcomes

https://doi.org/10.1002/cjs.70046

Yang, Lu; Genest, Christian; Nešlehová, Johanna G (March 2026, Canadian Journal of Statistics)

Abstract Regression models are often used to analyze discrete outcomes, but classical goodness‐of‐fit tests such as those based on the deviance or Pearson's statistic can be misleading or have little power in this context. To address this issue, we propose a new test, inspired by the work of Czado et al. (Biometrics, 65(4):1254–1261, 2009), which involves no randomization, tuning parameter, or binning of covariates. The statistic's large‐sample distribution under the null hypothesis is determined; as it involves unknown parameter values, one must resort to a bootstrap procedure to compute ‐values. Simulations are conducted to investigate the ability of the test to detect a broad range of model misspecifications commonly seen in practice. The proposed procedure is seen to perform well in all the scenarios considered as well as on real data.
more » « less
Full Text Available
A copula model for marked point process with a terminal event: An application in dynamic prediction of insurance claims

https://doi.org/10.1214/24-AOAS1902

Yang, Lu; Shi, Peng; Huang, Shimeng (December 2024, The Annals of Applied Statistics)

Accurate prediction of an insurer’s outstanding liabilities is crucial for maintaining the financial health of the insurance sector. We aim to develop a statistical model for insurers to dynamically forecast unpaid losses by leveraging the granular transaction data on individual claims. The liability cash flow from a single insurance claim is determined by an event process that describes the recurrences of payments, a payment process that generates a sequence of payment amounts, and a settlement process that terminates both the event and payment processes. More importantly, the three components are dependent on one another, which enables the dynamic prediction of an insurer’s outstanding liability. We introduce a copula-based point process framework to model the recurrent events of payment transactions from an insurance claim, where the longitudinal payment amounts and the time-to-settlement outcome are formulated as the marks and the terminal event of the counting process, respectively. The dependencies among the three components are characterized using the method of pair copula constructions. We further develop a stagewise strategy for parameter estimation and illustrate its desirable properties with numerical experiments. In the application we consider a portfolio of property insurance claims for building and contents coverage obtained from a commercial property insurance provider, where we find intriguing dependence patterns among the three components. The superior dynamic prediction performance of the proposed joint model enhances the insurer’s decision-making in claims reserving and risk financing operations.
more » « less
Full Text Available
Human selection maintains karyotype integrity of highly unstable genomic cultivated autotetraploid potato ( Solanum tuberosum )

https://doi.org/10.1126/sciadv.aea5207

Yang, Lu; Bai, Yanbo; Chao, Getu; Zhao, Kanglu; Xu, Junxiong; Sun, Meiping; Zhang, Hong; Niu, Yaqingqing; Li, Kexin; Xiong, Zinan; et al (December 2025, Science Advances)

Understanding meiotic and genomic stability in polyploid species is critical for advancing genetic discovery and breeding. Potato (Solanum tuberosum), the third most important food crop globally, is represented by cultivated potato, an autotetraploid with highly heterozygous genomes. Here, we revealed that cultivated potato showed different chromosomal pairing configurations and irregular chromosomal segregations, resulting in a high proportion of aneuploid gametes. Aneuploidy was identified in inbreeding and outbreeding populations of cultivated potato, with frequencies ranging from 14.8 to 24.0%, indicating notable genomic instability. Extensive novel copy number variations (CNVs) were detected in the progeny, which may increase genetic diversity. Molecular karyotyping of 50 commercial varieties revealed that all varieties were euploid, with significantly fewer CNVs, indicating that human selection maintains karyotypic integrity. Aneuploids in the outbreeding population exhibited poor agronomic traits and fitness defects, which demonstrated that genomic instability increases phenotypic diversity. Our study provides insights into genetic basis and phenotypic plasticity of cultivated potato, offering guidance for future breeding strategies.
more » « less
Full Text Available
Double Probability Integral Transform Residuals for Regression Models with Discrete Outcomes

https://doi.org/10.1080/10618600.2024.2303336

Yang, Lu (February 2024, Journal of Computational and Graphical Statistics)

The assessment of regression models with discrete outcomes is challenging and has many fundamental issues. With discrete outcomes, standard regression model assessment tools such as Pearson and deviance residuals do not follow the conventional reference distribution (normal) under the true model, calling into question the legitimacy of model assessment based on these tools. To fill this gap, we construct a new type of residuals for regression models with general discrete outcomes, including ordinal and count outcomes. The proposed residuals are based on two layers of probability integral transformation. When at least one continuous covariate is available, the proposed residuals closely follow a uniform distribution (or a normal distribution after transformation) under the correctly specified model. One can construct visualizations such as QQ plots to check the overall fit of a model straightforwardly, and the shape of QQ plots can further help identify possible causes of misspecification such as overdispersion. We provide theoretical justification for the proposed residuals by establishing their asymptotic properties. Moreover, in order to assess the mean structure and identify potential covariates, we develop an ordered curve as a supplementary tool, which is based on the comparison between the partial sum of outcomes and of fitted means. Through simulation, we demonstrate empirically that the proposed tools outperform commonly used residuals for various model assessment tasks. We also illustrate the workflow of model assessment using the proposed tools in data analysis. Supplementary materials for this article are available online.
more » « less
Full Text Available
Diagnostics for regression models with semicontinuous outcomes

https://doi.org/10.1093/biomtc/ujae007

Yang, Lu (January 2024, Biometrics)

Semicontinuous outcomes commonly arise in a wide variety of fields, such as insurance claims, healthcare expenditures, rainfall amounts, and alcohol consumption. Regression models, including Tobit, Tweedie, and two-part models, are widely employed to understand the relationship between semicontinuous outcomes and covariates. Given the potential detrimental consequences of model misspecification, after fitting a regression model, it is of prime importance to check the adequacy of the model. However, due to the point mass at zero, standard diagnostic tools for regression models (eg, deviance and Pearson residuals) are not informative for semicontinuous data. To bridge this gap, we propose a new type of residuals for semicontinuous outcomes that is applicable to general regression models. Under the correctly specified model, the proposed residuals converge to being uniformly distributed, and when the model is misspecified, they significantly depart from this pattern. In addition to in-sample validation, the proposed methodology can also be employed to evaluate predictive distributions. We demonstrate the effectiveness of the proposed tool using health expenditure data from the US Medical Expenditure Panel Survey.
more » « less
Full Text Available
Two-Dimensional Electrically Conductive Metal–Organic Framework Boosts Synaptic Plasticity for Dynamic Image Refresh, Classification, and Efferent Neuromuscular Systems

https://doi.org/10.1021/acs.nanolett.4c04650

Wei, Huanhuan; Liu, Jiaqi; Ni, Yao; Hu, Xuanxin; Lv, Xiuliang; Yang, Lu; He, Gang; Xu, Zhipeng; Gong, Jiangdong; Jiang, Chengpeng; et al (December 2024, Nano Letters)

Full Text Available
2dGBH: Two-dimensional group Benjamini-Hochberg procedure for false discovery rate control in Two-Way multiple testing of genomic data

https://doi.org/10.1093/bioinformatics/btae035

Yang, Lu; Wang, Pei; Chen, Jun (January 2024, Bioinformatics)
Schwartz, Russell (Ed.)
Abstract MotivationEmerging omics technologies have introduced a two-way grouping structure in multiple testing, as seen in single-cell omics data, where the features can be grouped by either genes or cell types. Traditional multiple testing methods have limited ability to exploit such two-way grouping structure, leading to potential power loss. ResultsWe propose a new two-dimensional Group Benjamin-Hochberg (2dGBH) procedure to harness the two-way grouping structure in omics data, extending the traditional one-way adaptive GBH procedure. Using both simulated and real datasets, we show that 2dGBH effectively controls the false discovery rate across biologically relevant settings, and it is more powerful than the BH or q-value procedure and more robust than the one-way adaptive GBH procedure. Availability and implementation2dGBH is available as an R package at: https://github.com/chloelulu/tdGBH. The analysis code and data are available at: https://github.com/chloelulu/tdGBH-paper. Supplementary informationSupplementary data are available at Bioinformatics online.
more » « less
Full Text Available
Robust Differential Abundance Analysis of Microbiome Sequencing Data

https://doi.org/10.3390/genes14112000

Li, Guanxun; Yang, Lu; Chen, Jun; Zhang, Xianyang (November 2023, Genes)

It is well known that the microbiome data are ridden with outliers and have heavy distribution tails, but the impact of outliers and heavy-tailedness has yet to be examined systematically. This paper investigates the impact of outliers and heavy-tailedness on differential abundance analysis (DAA) using the linear models for the differential abundance analysis (LinDA) method and proposes effective strategies to mitigate their influence. The presence of outliers and heavy-tailedness can significantly decrease the power of LinDA. We investigate various techniques to address outliers and heavy-tailedness, including generalizing LinDA into a more flexible framework that allows for the use of robust regression and winsorizing the data before applying LinDA. Our extensive numerical experiments and real-data analyses demonstrate that robust Huber regression has overall the best performance in addressing outliers and heavy-tailedness.
more » « less
Full Text Available
Benchmarking differential abundance analysis methods for correlated microbiome sequencing data

https://doi.org/10.1093/bib/bbac607

Yang, Lu; Chen, Jun (January 2023, Briefings in Bioinformatics)

Abstract Differential abundance analysis (DAA) is one central statistical task in microbiome data analysis. A robust and powerful DAA tool can help identify highly confident microbial candidates for further biological validation. Current microbiome studies frequently generate correlated samples from different microbiome sampling schemes such as spatial and temporal sampling. In the past decade, a number of DAA tools for correlated microbiome data (DAA-c) have been proposed. Disturbingly, different DAA-c tools could sometimes produce quite discordant results. To recommend the best practice to the field, we performed the first comprehensive evaluation of existing DAA-c tools using real data-based simulations. Overall, the linear model-based methods LinDA, MaAsLin2 and LDM are more robust than methods based on generalized linear models. The LinDA method is the only method that maintains reasonable performance in the presence of strong compositional effects.
more » « less
Full Text Available
A comprehensive evaluation of microbial differential abundance analysis methods: current status and potential solutions

https://doi.org/10.1186/s40168-022-01320-0

Yang, Lu; Chen, Jun (December 2022, Microbiome)

Abstract Background Differential abundance analysis (DAA) is one central statistical task in microbiome data analysis. A robust and powerful DAA tool can help identify highly confident microbial candidates for further biological validation. Numerous DAA tools have been proposed in the past decade addressing the special characteristics of microbiome data such as zero inflation and compositional effects. Disturbingly, different DAA tools could sometimes produce quite discordant results, opening to the possibility of cherry-picking the tool in favor of one’s own hypothesis. To recommend the best DAA tool or practice to the field, a comprehensive evaluation, which covers as many biologically relevant scenarios as possible, is critically needed. Results We performed by far the most comprehensive evaluation of existing DAA tools using real data-based simulations. We found that DAA methods explicitly addressing compositional effects such as ANCOM-BC, Aldex2, metagenomeSeq (fitFeatureModel), and DACOMP did have improved performance in false-positive control. But they are still not optimal: type 1 error inflation or low statistical power has been observed in many settings. The recent LDM method generally had the best power, but its false-positive control in the presence of strong compositional effects was not satisfactory. Overall, none of the evaluated methods is simultaneously robust, powerful, and flexible, which makes the selection of the best DAA tool difficult. To meet the analysis needs, we designed an optimized procedure, ZicoSeq, drawing on the strength of the existing DAA methods. We show that ZicoSeq generally controlled for false positives across settings, and the power was among the highest. Application of DAA methods to a large collection of real datasets revealed a similar pattern observed in simulation studies. Conclusions Based on the benchmarking study, we conclude that none of the existing DAA methods evaluated can be applied blindly to any real microbiome dataset. The applicability of an existing DAA method depends on specific settings, which are usually unknown a priori. To circumvent the difficulty of selecting the best DAA tool in practice, we design ZicoSeq, which addresses the major challenges in DAA and remedies the drawbacks of existing DAA methods. ZicoSeq can be applied to microbiome datasets from diverse settings and is a useful DAA tool for robust microbiome biomarker discovery.
more » « less
Full Text Available

« Prev Next »

Search for: All records